Open Tabs
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
Kernels
- 17-ts-core.ipynb
- 031-data-wrangling-with-mongodb.ipynb
- 04-pandas-advanced.ipynb
- 18-ts-models.ipynb
- 11-databases-mongodb.ipynb
- 10-databases-sql.ipynb
- 033-autoregressive-models.ipynb
- 035-assignment.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 02-python-advanced.ipynb
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynb
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynb
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb
- 07-visualization-pandas.ipynb
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
Terminals
- .ipynb_checkpointsan hour ago
- data2 months ago
- 01-python-getting-started.2022-12-23T07-21-48-609Z.ipynban hour ago
- 01-python-getting-started.ipynb2 months ago
- 02-python-advanced.ipynb2 hours ago
- 03-pandas-getting-started.2022-12-23T07-21-48-609Z.ipynb2 months ago
- 03-pandas-getting-started.ipynb2 months ago
- 04-pandas-advanced.2022-12-23T07-21-48-609Z.ipynban hour ago
- 04-pandas-advanced.ipynb2 months ago
- 05-pandas-summary-statistics.2022-12-23T07-21-48-609Z.ipynban hour ago
- 05-pandas-summary-statistics.ipynb2 months ago
- 06-visualization-matplotlib.2022-12-23T07-21-48-609Z.ipynb32 minutes ago
- 06-visualization-matplotlib.ipynb2 months ago
- 07-visualization-pandas.ipynb7 minutes ago
- 08-visualization-plotly.ipynb2 minutes ago
- 09-visualization-seaborn.ipynba month ago
- 10-databases-sql.ipynb2 months ago
- 11-databases-mongodb.ipynb19 days ago
- 12-ml-core.ipynb2 months ago
- 13-ml-data-pre-processing-and-production.ipynba month ago
- 14-ml-classification.ipynb2 months ago
- 15-ml-regression.ipynba month ago
- 16-ml-unsupervised-learning.ipynb2 months ago
- 17-ts-core.ipynb2 months ago
- 18-ts-models.ipynb2 months ago
- 19-linux-command-line.ipynb2 months ago
- 20-statistics.ipynb2 months ago
- 21-python-object-oriented-programming.ipynb2 months ago
- 22-apis.ipynb2 months ago
- main.py3 months ago
- 08-visualization-plotly.ipynb
- 09-visualization-seaborn.ipynb
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: plotly express</strong></font>Visualizing Data: plotly express
xxxxxxxxxxThere are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), plotly express, and [seaborn](../%40textbook/09-visualization-seaborn.ipynb). In this section, we'll focus on using plotly express.There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using plotly express.
xxxxxxxxxx# Scatter PlotsScatter Plots¶
xxxxxxxxxxA **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**.A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.
import pandas as pdmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxAfter cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.After cleaning the data, we can use plotly express to draw scatter plots by specifying the DataFrame and the interested column names.
import plotly.express as pxfig = px.scatter(mexico_city1, x="price", y="surface_covered_in_m2")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Plot the scatter plot for column "price" and "surface_total_in_m2".
xxxxxxxxxx# 3D Scatter Plots3D Scatter Plots¶
Scatter plots can summarize information in a DataFrame. Three dimensional scatter plots look great, but be careful: it can be difficult for people who might not be sure what they're looking at to accurately determine values of points in the plot. Still, scatter plots are useful for displaying relationships between three quantities that would be more difficult to observe in a two dimensional plot.
Let's take a look at the first 50 rows of the mexico-city-real-estate-1.csv dataset.
import pandas as pdimport plotly.express as pxmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1_subset = mexico_city1.loc[1:50]fig = px.scatter_3d( mexico_city1_subset, x="Borough", y="surface_covered_in_m2", z="price", symbol="property_type", color="property_type", labels={ "surface_covered_in_m2": "Surface Covered in m^2", "price": "Price", "property_type": "Property Type", },)fig.show()xxxxxxxxxxNotice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.Notice that the plot is interactive: you can rotate it zoom in or out. These kinds of plots also makes outliers easier to find; here, we can see that houses have higher prices than other types of properties.
xxxxxxxxxx<font size="+1">Practice</font> Practice
Modify the DataFrame to include columns for the base 10 log of price and surface_covered_in_m2 and then plot these for the entire mexico-city-real-estate-1.csv dataset.
import mathxxxxxxxxxx# Mapbox Scatter PlotsMapbox Scatter Plots¶
xxxxxxxxxxA **mapbox scatter plot** is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our `mexico-city-real-estate-1.csv` dataset is a good place to start, because it includes **location data**. After importing the dataset and removing rows with missing data, split the `lat-lon` column into two separate columns: one for `latitude` and the other for `longitude`. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.A mapbox scatter plot is a special kind of scatter plot that allows you to create scatter plots in two dimensions and then superimpose them on top of a map. Our mexico-city-real-estate-1.csv dataset is a good place to start, because it includes location data. After importing the dataset and removing rows with missing data, split the lat-lon column into two separate columns: one for latitude and the other for longitude. Then use these to make a mapbox plot. Unfortunately, at present this type of plot does not easily allow for marker shape to vary based on a column of the DataFrame.
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1[["latitude", "longitude"]] = mexico_city1["lat-lon"].str.split( ",", 2, expand=True)mexico_city1["latitude"] = mexico_city1["latitude"].astype(float)mexico_city1["longitude"] = mexico_city1["longitude"].astype(float)fig = px.scatter_mapbox( mexico_city1, lat="latitude", lon="longitude", color="property_type", mapbox_style="carto-positron", labels={"property_type": "Property Type"}, title="Distribution of Property Types for Sale in Mexico City",)fig.show()/tmp/ipykernel_659/3692783844.py:6: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
xxxxxxxxxx<font size="+1">Practice</font> Practice
Create another column in the DataFrame with a log scale of the prices. Then create three separate plots, one for stores, another for houses, and a final one for apartments. Color the points in the plots by the log of the price.
from math import log10xxxxxxxxxx# Choropleth MapsChoropleth Maps¶
xxxxxxxxxxA Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called `px.choropleth` that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set. A Choropleth Map is a map composed of colored polygons, showing the variable of interest at different color depth across geographies.Plotly express has a function called px.choropleth that be used to plot Choropleth Map. The challenges here are getting the geometry information. There are two ways, one is to use the built-in geometries in plotly when plot US States (use the state name directly) and world countries (use ISP-3 code). Another way is to look for GeoJSON files where each location has geometry information. In the following example, we will show the plot in US States with a synthetic data set.
xxxxxxxxxx# Create Synthetic Datasetdf = pd.DataFrame.from_dict( {"State": ["CA", "TX", "NY", "HI", "DE"], "Temparature": [100, 120, 110, 90, 105]})df| State | Temparature | |
|---|---|---|
| 0 | CA | 100 |
| 1 | TX | 120 |
| 2 | NY | 110 |
| 3 | HI | 90 |
| 4 | DE | 105 |
xxxxxxxxxx# Plot the data set in US mapfig = px.choropleth( df, locations="State", locationmode="USA-states", color="Temparature", scope="usa")fig.show()xxxxxxxxxx# HistogramHistogram¶
xxxxxxxxxxA **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the `px.histogram()` function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:A histogram is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers. We can use the px.histogram() function from Plotly to draw histograms for specific columns, as long as the data type is numerical. Let's check the following example:
xxxxxxxxxximport plotly.express as pxdf = pd.read_csv("data/mexico-city-real-estate-1.csv")fig = px.histogram(df, x="price")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Check the "surface_covered_in_m2" Histogram.
xxxxxxxxxx# BoxplotsBoxplots¶
xxxxxxxxxxA **boxplot** is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.A boxplot is a graph that shows the minimum, first quartile, median, third quartile, and the maximum values in a dataset. Boxplots are useful because they provide a visual summary of the data, enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness. In the following example, we will explore how to draw boxplots for specific columns of a DataFrame.
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxCheck the boxplot for column `"price"`:Check the boxplot for column "price":
xxxxxxxxxxIf you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an `x` argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:If you want to check the distribution of a column value by different categories, defined by another categorical column, you can add an x argument to specify the name of the categorical column. In the following example, we check the price distribution across different property types:
xxxxxxxxxx<font size="+1">Practice</font> Practice
Check the "surface_covered_in_m2" distribution by property types.
xxxxxxxxxxfig = ...fig.show()--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [15], line 2 1 fig = ... ----> 2 fig.show() AttributeError: 'ellipsis' object has no attribute 'show'
xxxxxxxxxx# Bar ChartBar Chart¶
xxxxxxxxxxA **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.
In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to read the data set and extract Borough and other location information from column "place_with_parent_names".
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)# find location columns from place_with_parent_namesmexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]mexico_city1.head()xxxxxxxxxxWe can calculate the number of real estate showing in the data set by Borough using `value_counts()`, then plot it as bar plot:We can calculate the number of real estate showing in the data set by Borough using value_counts(), then plot it as bar plot:
xxxxxxxxxx# Use value_counts() to get the datamexico_city1["Borough"].value_counts()xxxxxxxxxx# Plot value_counts() datafig = px.bar(mexico_city1["Borough"].value_counts())fig.show()xxxxxxxxxxWe can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use `groupby` to calculate the aggregated counts for each Borough and property type combination:We can plot more expressive bar plots by adding more arguments. For example, we can plot the number of observations by borough and property type. First of all, we need use groupby to calculate the aggregated counts for each Borough and property type combination:
xxxxxxxxxxsize_df = mexico_city1.groupby(["Borough", "property_type"], as_index=False).size()size_df.head()xxxxxxxxxxBy specifying `x`, `y` and `color`, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note `y` has to be numerical, while `x` and `color` are usually categorical variables.<span style='color: transparent; font-size:1%'>WQU WorldQuant University Applied Data Science Lab QQQQ</span>By specifying x, y and color, the following bar graph shows the total counts by Borough, with different property types showing in different colors. Note y has to be numerical, while x and color are usually categorical variables.WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="relative")fig.show()xxxxxxxxxxNote the argument `barmode` is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.Note the argument barmode is specified as 'relative', which is also the default value. In this mode, bars are stacked above each other. We can also use 'overlay' where bars are drawn on top of each other.
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="overlay")fig.show()xxxxxxxxxxIf we want bars to be placed beside each other, we can specify `barmode` as "group":If we want bars to be placed beside each other, we can specify barmode as "group":
xxxxxxxxxxfig = px.bar(size_df, x="Borough", y="size", color="property_type", barmode="group")fig.show()xxxxxxxxxx<font size="+1">Practice</font> Practice
Plot bar plot for the number of observations by property types in "mexico-city-real-estate-1.csv".
xxxxxxxxxxbar_df = ...fig = ...fig.show()xxxxxxxxxx# References and Further ReadingReferences and Further Reading¶
- Official plotly express Documentation on Scatter Plots
- Official plotly Express Documentation on 3D Plots
- Official plotly Documentation on Notebooks
- plotly Community Forum Post on Axis Labeling
- plotly express Official Documentation on Tile Maps
- plotly Choropleth Maps in Python Document
- plotly express Official Documentation on Figure Display
- Online Tutorial on String Conversion in Pandas
- Official Pandas Documentation on using Lambda Functions on a Column
- Official Seaborn Documentation on Generating a Heatmap
- Online Tutorial on Correlation Matrices in Pandas
- Official Pandas Documentation on Correlation Matrices
- Official Matplotlib Documentation on Colormaps
- Official Pandas Documentation on Box Plots
- Online Tutorial on Box Plots
- Online Tutorial on Axes Labels in Seaborn and Matplotlib
- Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxxUsage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: seaborn</strong></font>Visualizing Data: seaborn
xxxxxxxxxxThere are many ways to interact with data, and one of the most powerful modes of interaction is through **visualizations**. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: [pandas](../%40textbook/07-visualization-pandas.ipynb), [Matplotlib](../%40textbook/06-visualization-matplotlib.ipynb), [plotly express](../%40textbook/08-visualization-plotly.ipynb), and seaborn. In this section, we'll focus on using seaborn.There are many ways to interact with data, and one of the most powerful modes of interaction is through visualizations. Visualizations show data graphically, and are useful for exploring, analyzing, and presenting datasets. We use four libraries for making visualizations: pandas, Matplotlib, plotly express, and seaborn. In this section, we'll focus on using seaborn.
xxxxxxxxxx# Scatter PlotsScatter Plots¶
xxxxxxxxxxA **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for **correlations**. A scatter plot is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.
In the following example, we will see some scatter plots based on the Mexico City real estate data. Specifically, we can use scatter plot to show how "price" and "surface_covered_in_m2" are correlated. First we need to read the data set and do a little cleaning.
import pandas as pdimport seaborn as sns# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1.head()| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... |
xxxxxxxxxxUse seaborn to plot the scatter plot for `"price"` and `"surface_covered_in_m2"`:Use seaborn to plot the scatter plot for "price" and "surface_covered_in_m2":
sns.scatterplot(data=mexico_city1, x="price", y="surface_covered_in_m2");xxxxxxxxxxThere is a very useful argument in `scatterplot` called `hue`. By specifying a categorical column as `hue`, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using `"property_type"`:There is a very useful argument in scatterplot called hue. By specifying a categorical column as hue, seaborn can create a scatter plot between two variables in different categories with different colors. Let's check the following example using "property_type":
sns.scatterplot( data=mexico_city1, x="price", y="surface_covered_in_m2", hue="property_type");xxxxxxxxxx<font size="+1">Practice</font>Practice
Plot a scatter plot for "price" and "surface_total_in_m2" by "property_type" for "mexico-city-real-estate-1.csv":
xxxxxxxxxx# Bar ChartsBar Charts¶
xxxxxxxxxxA **bar chart** is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale. A bar chart is a graph that shows all the values of a categorical variable in a dataset. They consist of an axis and a series of labeled horizontal or vertical bars. The bars depict frequencies of different values of a variable or simply the different values themselves. The numbers on the y-axis of a vertical bar chart or the x-axis of a horizontal bar chart are called the scale.
In the following example, we will see some bar plots based on the Mexico City real estate dataset. Specifically, we will count the number of observations in each borough and plot them. We first need to import the dataset and extract the borough and other location information from column "place_with_parent_names".
xxxxxxxxxx# Read Datamexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")# Clean the data and drop `NaNs`mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)# find location columns from place_with_parent_namesmexico_city1[ ["First Empty", "Country", "City", "Borough", "Second Empty"]] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)mexico_city1 = mexico_city1.drop(["First Empty", "Second Empty"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)# Exclude some outliersmexico_city1 = mexico_city1[mexico_city1["price"] < 100000000]mexico_city1 = mexico_city1[mexico_city1["Borough"] != ""]mexico_city1.head()/tmp/ipykernel_721/836102575.py:12: FutureWarning: In a future version of pandas all arguments of StringMethods.split except for the argument 'pat' will be keyword-only.
] = mexico_city1["place_with_parent_names"].str.split("|", 4, expand=True)
| operation | property_type | place_with_parent_names | lat-lon | price | currency | price_aprox_local_currency | price_aprox_usd | surface_total_in_m2 | surface_covered_in_m2 | price_per_m2 | properati_url | Country | City | Borough | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 2700000.0 | MXN | 2748947.10 | 146154.51 | 61.0 | 61.0 | 44262.295082 | http://cuauhtemoc.properati.com.mx/2pu_venta_a... | México | Distrito Federal | Cuauhtémoc |
| 3 | sell | apartment | |México|Distrito Federal|Cuauhtémoc| | 19.41501,-99.175174 | 6347000.0 | MXN | 6462061.92 | 343571.36 | 176.0 | 128.0 | 49585.937500 | http://cuauhtemoc.properati.com.mx/2pv_venta_a... | México | Distrito Federal | Cuauhtémoc |
| 6 | sell | apartment | |México|Distrito Federal|Miguel Hidalgo| | 19.456564,-99.191724 | 670000.0 | MXN | 682146.11 | 36267.97 | 65.0 | 65.0 | 10307.692308 | http://miguel-hidalgo-df.properati.com.mx/46h_... | México | Distrito Federal | Miguel Hidalgo |
| 7 | sell | apartment | |México|Distrito Federal|Gustavo A. Madero| | 19.512787,-99.141393 | 1400000.0 | MXN | 1425379.97 | 75783.82 | 82.0 | 70.0 | 20000.000000 | http://gustavo-a-madero.properati.com.mx/46p_v... | México | Distrito Federal | Gustavo A. Madero |
| 8 | sell | house | |México|Distrito Federal|Álvaro Obregón| | 19.358776,-99.213557 | 6680000.0 | MXN | 6801098.67 | 361597.08 | 346.0 | 346.0 | 19306.358382 | http://alvaro-obregon.properati.com.mx/46t_ven... | México | Distrito Federal | Álvaro Obregón |
xxxxxxxxxxLet's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:Let's check the example of a bar plot showing the value counts of each borough in the dataset. We first need to create a DataFrame showing the value counts:
bar_df = pd.DataFrame(mexico_city1["Borough"].value_counts()).reset_index()bar_df| index | Borough | |
|---|---|---|
| 0 | Miguel Hidalgo | 345 |
| 1 | Cuajimalpa de Morelos | 255 |
| 2 | Álvaro Obregón | 203 |
| 3 | Benito Juárez | 198 |
| 4 | Tlalpan | 171 |
| 5 | Iztapalapa | 134 |
| 6 | Tláhuac | 125 |
| 7 | Cuauhtémoc | 120 |
| 8 | Gustavo A. Madero | 89 |
| 9 | Venustiano Carranza | 81 |
| 10 | Coyoacán | 80 |
| 11 | La Magdalena Contreras | 41 |
| 12 | Xochimilco | 34 |
| 13 | Iztacalco | 27 |
| 14 | Azcapotzalco | 24 |
| 15 | Milpa Alta | 1 |
xxxxxxxxxxSince there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:Since there are 16 different categories in Borough, we should increase the default plot size and rotate the x axis to make the plot more readable using the following syntax:
# Increase plot sizesns.set(rc={"figure.figsize": (15, 4)})# Plot the bar plotax = sns.barplot(data=bar_df, x="index", y="Borough")# Rotate the x axisax.set_xticklabels(ax.get_xticklabels(), rotation=75)[Text(0, 0, 'Miguel Hidalgo'), Text(1, 0, 'Cuajimalpa de Morelos'), Text(2, 0, 'Álvaro Obregón'), Text(3, 0, 'Benito Juárez'), Text(4, 0, 'Tlalpan'), Text(5, 0, 'Iztapalapa'), Text(6, 0, 'Tláhuac'), Text(7, 0, 'Cuauhtémoc'), Text(8, 0, 'Gustavo A. Madero'), Text(9, 0, 'Venustiano Carranza'), Text(10, 0, 'Coyoacán'), Text(11, 0, 'La Magdalena Contreras'), Text(12, 0, 'Xochimilco'), Text(13, 0, 'Iztacalco'), Text(14, 0, 'Azcapotzalco'), Text(15, 0, 'Milpa Alta')]
xxxxxxxxxx<font size="+1">Practice</font>Practice
Plot a bar plot showing the value counts for property types in "mexico-city-real-estate-1.csv":
xxxxxxxxxxpro_typ_df = pd.DataFrame(mexico_city1["property_type"].value_counts()).reset_index()pro_typ_dfsns.barplot(data =pro_typ_df,x="index",y="property_type")<AxesSubplot:xlabel='index', ylabel='property_type'>
xxxxxxxxxx# Correlation HeatmapsCorrelation Heatmaps¶
A correlation heatmap shows the relative strength of correlations between the variables in a dataset. Here's what the code looks like:
xxxxxxxxxximport pandas as pdimport seaborn as snsmexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")mexico_city1 = mexico_city1.drop( ["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1)mexico_city1 = mexico_city1.dropna(axis=0)mexico_city1_numeric = mexico_city1.select_dtypes(include="number")corr = mexico_city1_numeric.corr(method="kendall")sns.heatmap(corr)<AxesSubplot:>
xxxxxxxxxxNotice that we dropped the columns and rows with missing entries before plotting the graph.Notice that we dropped the columns and rows with missing entries before plotting the graph.
This heatmap is showing us what we might already have suspected: the price is moderately positively correlated with the size of the properties.
xxxxxxxxxx<font size="+1">Practice</font>Practice
The seaborn documentation on heat maps indicates how to add numeric labels to each cell and how to use a different colormap. Modify the plot to use the viridis colormap, have a linewidth of 0.5 between each cell and have numeric labels for each cell.
xxxxxxxxxxxxxxxxxxxx# References and Further ReadingReferences and Further Reading¶
- Official Plotly Express Documentation on Scatter Plots
- Official Plotly Express Documentation on 3D Plots
- Official Plotly Documentation on Notebooks
- Plotly Community Forum Post on Axis Labeling
- Plotly Express Official Documentation on Tile Maps
- Plotly Express Official Documentation on Figure Display
- Online Tutorial on String Conversion in Pandas
- Official Pandas Documentation on using Lambda Functions on a Column
- Official seaborn Documentation on Generating a Heatmap
- Online Tutorial on Correlation Matrices in Pandas
- Official Pandas Documentation on Correlation Matrices
- Official Matplotlib Documentation on Colormaps
- Official Pandas Documentation on Box Plots
- Online Tutorial on Box Plots
- Online Tutorial on Axes Labels in seaborn and Matplotlib
- Matplotlib Gallery Example of an Annotated Heatmap
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxx---Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
xxxxxxxxxx<font size="+3"><strong>Visualizing Data: seaborn</strong></font>xxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx- Scatter Plots
- Bar Charts
- Correlation Heatmaps
- References and Further Reading